-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for rule-based anomaly detection and imputation #8202
Conversation
This PR introduces new documentation for rule-based anomaly detection (AD) and imputation options, providing detailed guidance on configuring these features. It also updates the maximum shingle size information and enhances the documentation for window delay settings. Testing done: - Successfully ran Jekyll build and reviewed the updated documentation to ensure all changes are correctly displayed. Signed-off-by: Kaituo Li <[email protected]>
Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged. Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer. When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review. |
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Copy edit documentation Signed-off-by: Melissa Vagi <[email protected]>
Doc review complete Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc review complete. Moving to editorial review.
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_observing-your-data/ad/index.md
Outdated
|
||
Anomaly detection automatically detects anomalies in your OpenSearch data in near real-time using the Random Cut Forest (RCF) algorithm. RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an `anomaly grade` and `confidence score` value for each incoming data point. These values are used to differentiate an anomaly from normal variations. For more information about how RCF works, see [Random Cut Forests](https://www.semanticscholar.org/paper/Robust-Random-Cut-Forest-Based-Anomaly-Detection-on-Guha-Mishra/ecb365ef9b67cd5540cc4c53035a6a7bd88678f9). | ||
Anomaly detection automatically detects anomalies in your OpenSearch data in near real time using the Random Cut Forest (RCF) algorithm. RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an _anomaly grade_ and _confidence score_ value for each incoming data point. These values are used to differentiate an anomaly from normal variations. For more information about how RCF works, see [Random Cut Forests](https://www.semanticscholar.org/paper/Robust-Random-Cut-Forest-Based-Anomaly-Detection-on-Guha-Mishra/ecb365ef9b67cd5540cc4c53035a6a7bd88678f9). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last sentence: Rather than "Random Cut Forests", it looks like the title of the page is "Robust Random Cut Forest Based Anomaly Detection on Streams".
_observing-your-data/ad/index.md
Outdated
|
||
## Step 1: Define a detector | ||
|
||
A detector is an individual anomaly detection task. You can define multiple detectors, and all the detectors can run simultaneously, with each analyzing data from different sources. | ||
A _detector_ is an individual anomaly detection task. You can define multiple detectors, and all detectors can run simultaneously, with each analyzing data from different sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a sentence here introducing the list.
_observing-your-data/ad/index.md
Outdated
A _detector_ is an individual anomaly detection task. You can define multiple detectors, and all detectors can run simultaneously, with each analyzing data from different sources. | ||
|
||
1. On the **Anomaly detection** page, select the **Create detector** button. | ||
2. On the **Define detector** page, enter the required information on the **Detector details** pane. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. On the **Define detector** page, enter the required information on the **Detector details** pane. | |
2. On the **Define detector** page, enter the required information in the **Detector details** pane. |
_observing-your-data/ad/index.md
Outdated
|
||
1. On the **Anomaly detection** page, select the **Create detector** button. | ||
2. On the **Define detector** page, enter the required information on the **Detector details** pane. | ||
3. On the **Select data** pane, specify the data source by choosing a source from the **Index** dropdown menu. You can choose an index, index patterns, or alias. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. On the **Select data** pane, specify the data source by choosing a source from the **Index** dropdown menu. You can choose an index, index patterns, or alias. | |
3. In the **Select data** pane, specify the data source by choosing a source from the **Index** dropdown menu. You can choose an index, index patterns, or an alias. |
_observing-your-data/ad/index.md
Outdated
|
||
#### Example filter using query DSL | ||
The query is designed to retrieve documents in which the `urlPath.keyword` field matches one of the following specified values: | ||
The following example query retrieves documents where the `urlPath.keyword` field matches any of the specified values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following example query retrieves documents where the `urlPath.keyword` field matches any of the specified values: | |
The following example query retrieves documents in which the `urlPath.keyword` field matches any of the specified values: |
|
||
You can see the following additional fields: | ||
Note that the result includes the following additional field: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the result includes the following additional field: | |
Note that the result includes the following additional field. |
At times, the detector might detect an anomaly late. | ||
Let's say the detector sees a random mix of the triples {1, 2, 3} and {2, 4, 5} that correspond to `slow weeks` and `busy weeks`, respectively. For example 1, 2, 3, 1, 2, 3, 2, 4, 5, 1, 2, 3, 2, 4, 5, ... and so on. | ||
If the detector comes across a pattern {2, 2, X} and it's yet to see X, the detector infers that the pattern is anomalous, but it can't determine at this point which of the 2's is the cause. If X = 3, then the detector knows it's the first 2 in that unfinished triple, and if X = 5, then it's the second 2. If it's the first 2, then the detector detects the anomaly late. | ||
The detector may detect an anomaly late. For example, the detector observes a sequence of data that alternates between "slow weeks" (represented by the triples {1, 2, 3}) and "busy weeks" (represented by the triples {2, 4, 5}). If the detector comes across a pattern {2, 2, X}, where it has not yet seen the value that X will take, the detector infers that the pattern is anomalous. However, it cannot determine which of the 2's is the cause. If X = 3, then the first 2 is the anomaly. If X = 5, then the second 2 is the anomaly. If it is the first 2, then the detector would detect the anomaly late. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The detector may detect an anomaly late. For example, the detector observes a sequence of data that alternates between "slow weeks" (represented by the triples {1, 2, 3}) and "busy weeks" (represented by the triples {2, 4, 5}). If the detector comes across a pattern {2, 2, X}, where it has not yet seen the value that X will take, the detector infers that the pattern is anomalous. However, it cannot determine which of the 2's is the cause. If X = 3, then the first 2 is the anomaly. If X = 5, then the second 2 is the anomaly. If it is the first 2, then the detector would detect the anomaly late. | |
The detector may be late in detecting an anomaly. For example: The detector observes a sequence of data that alternates between "slow weeks" (represented by the triples {1, 2, 3}) and "busy weeks" (represented by the triples {2, 4, 5}). If the detector comes across a pattern {2, 2, X}, where it has not yet seen the value that X will take, then the detector infers that the pattern is anomalous. However, it cannot determine which 2 is the cause. If X = 3, then the first 2 is the anomaly. If X = 5, then the second 2 is the anomaly. If it is the first 2, then the detector will be late in detecting the anomaly. |
|
||
If a detector detects an anomaly late, the result has the following additional fields: | ||
When a detector detects an anomaly late, the result includes the following additional fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a detector detects an anomaly late, the result includes the following additional fields: | |
When a detector is late in detecting an anomaly, the result includes the following additional fields. |
|
||
Field | Description | ||
:--- | :--- | ||
`past_values` | The actual input that triggered an anomaly. If `past_values` is null, the attributions or expected values are from the current input. If `past_values` is not null, the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]). | ||
`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors don't query previous anomaly results because these queries are expensive operations. The cost is especially high for high-cardinality detectors that might have a lot of entities. If the data is not continuous, the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier. | ||
`past_values` | The actual input that triggered an anomaly. If `past_values` is null, then the attributions or expected values are from the current input. If `past_values` is not null, then the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should both instances of "null" be in code font?
`past_values` | The actual input that triggered an anomaly. If `past_values` is null, the attributions or expected values are from the current input. If `past_values` is not null, the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]). | ||
`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors don't query previous anomaly results because these queries are expensive operations. The cost is especially high for high-cardinality detectors that might have a lot of entities. If the data is not continuous, the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier. | ||
`past_values` | The actual input that triggered an anomaly. If `past_values` is null, then the attributions or expected values are from the current input. If `past_values` is not null, then the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]). | ||
`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors do not query previous anomaly results because these queries are costly operations. The cost is especially high for high-cardinality detectors that may have many entities. If the data is not continuous, then the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors do not query previous anomaly results because these queries are costly operations. The cost is especially high for high-cardinality detectors that may have many entities. If the data is not continuous, then the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier. | |
`approx_anomaly_start_time` | The approximate time of the actual input that triggered an anomaly. This field helps you understand the time at which a detector flags an anomaly. Both single-stream and high-cardinality detectors do not query previous anomaly results because these queries are costly operations. The cost is especially high for high-cardinality detectors that may have many entities. If the data is not continuous, then the accuracy of this field is low and the actual time at which the detector detects an anomaly can be earlier. |
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
@natebower @kaituo I've addressed the editorial feedback and revised text that had comments. Do you want to give it another read? |
Signed-off-by: Melissa Vagi <[email protected]>
_observing-your-data/ad/index.md
Outdated
1. Choose **Next**. | ||
Using these options can improve recall in anomaly detection. For instance, if you are monitoring for drops in event counts, including both partial and complete drops, then filling missing values with zeros helps detect significant data absences, improving detection recall. | ||
|
||
Be cautious when imputing extensively missing data, as excessive gaps can compromise model accuracy. Quality input is critical---poor data quality leads to poor model performance. You can check whether a feature value has been imputed using the `feature_imputed` field in the anomaly results index. See [Anomaly result mapping]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/result-mapping/) for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add "The confidence score decreases when imputations occur."? So there are two signals from imputation: feature_imputed field and confidence score.
Signed-off-by: Melissa Vagi <[email protected]>
…ensearch-project#8202) * Add documentation for rule-based anomaly detection and imputation This PR introduces new documentation for rule-based anomaly detection (AD) and imputation options, providing detailed guidance on configuring these features. It also updates the maximum shingle size information and enhances the documentation for window delay settings. Testing done: - Successfully ran Jekyll build and reviewed the updated documentation to ensure all changes are correctly displayed. Signed-off-by: Kaituo Li <[email protected]> * Doc review Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/result-mapping.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update index.md Copy edit documentation Signed-off-by: Melissa Vagi <[email protected]> * Update result-mapping.md Doc review complete Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> * Fix links Signed-off-by: Melissa Vagi <[email protected]> * Fix links Signed-off-by: Melissa Vagi <[email protected]> * Address editorial feedback Signed-off-by: Melissa Vagi <[email protected]> * Address editorial feedback Signed-off-by: Melissa Vagi <[email protected]> * Update _observing-your-data/ad/index.md Signed-off-by: Melissa Vagi <[email protected]> --------- Signed-off-by: Kaituo Li <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: Noah Staveley <[email protected]>
Description
This PR introduces new documentation for rule-based anomaly detection (AD) and imputation options, providing detailed guidance on configuring these features. It also updates the maximum shingle size information and enhances the documentation for window delay settings.
Testing done:
Issues Resolved
closes #8169
Version
2.17+
Frontend features
If you're submitting documentation for an OpenSearch Dashboards feature, add a video that shows how a user will interact with the UI step by step. A voiceover is optional.
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.